{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "%load_ext autoreload\n",
    "%autoreload 2"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# fastText introduction\n",
    "\n",
    "fastText model is a simple and fast baseline model for text classification. It learns about features (n-grams)  embedding, which are averaged to form the hidden vector representation of a document. Its accuracy is on par with deep learning classifiers, but is orders of magnititute faster for training and evaluation. The fastYext model provides another baseline model for text classification besides bags of words model in autogluon.   \n",
    "\n",
    "To start, import AutoGluon's TabularPredictor:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import logging\n",
    "logging.basicConfig(format='%(asctime)s: [%(funcName)s] %(message)s',\n",
    "                   level=logging.INFO)\n",
    "logger = logging.getLogger(__name__)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "from sklearn.metrics import accuracy_score\n",
    "\n",
    "from autogluon.tabular import TabularDataset, TabularPredictor\n",
    "from autogluon.features.generators import AutoMLPipelineFeatureGenerator\n",
    "from autogluon.tabular.models.fasttext.fasttext_model import FastTextModel"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# load data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Load training data from a CSV file into an AutoGluon Dataset object. This object is essentially equivalent to a [Pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) and the same methods can be applied to both."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/local/home/qiangson/workspace/autosense/code/conda-env-0820/lib/python3.7/site-packages/ipykernel/ipkernel.py:287: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.\n",
      "  and should_run_async(code)\n",
      "2020-12-09 22:46:46,415: [load] Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv | Columns = 15 / 15 | Rows = 39073 -> 39073\n",
      "2020-12-09 22:46:46,527: [load] Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv | Columns = 15 / 15 | Rows = 9769 -> 9769\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   age   workclass  fnlwgt   education  education-num       marital-status  \\\n",
      "0   25     Private  178478   Bachelors             13        Never-married   \n",
      "1   23   State-gov   61743     5th-6th              3        Never-married   \n",
      "2   46     Private  376789     HS-grad              9        Never-married   \n",
      "3   55           ?  200235     HS-grad              9   Married-civ-spouse   \n",
      "4   36     Private  224541     7th-8th              4   Married-civ-spouse   \n",
      "\n",
      "           occupation    relationship    race      sex  capital-gain  \\\n",
      "0        Tech-support       Own-child   White   Female             0   \n",
      "1    Transport-moving   Not-in-family   White     Male             0   \n",
      "2       Other-service   Not-in-family   White     Male             0   \n",
      "3                   ?         Husband   White     Male             0   \n",
      "4   Handlers-cleaners         Husband   White     Male             0   \n",
      "\n",
      "   capital-loss  hours-per-week  native-country  class  \n",
      "0             0              40   United-States  <=50K  \n",
      "1             0              35   United-States  <=50K  \n",
      "2             0              15   United-States  <=50K  \n",
      "3             0              50   United-States   >50K  \n",
      "4             0              40     El-Salvador  <=50K  \n"
     ]
    }
   ],
   "source": [
    "train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')\n",
    "train_data['class'] = train_data['class'].str.strip()\n",
    "test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')  # another Pandas DataFrame\n",
    "print(train_data.head())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that we loaded data from a CSV file stored in the cloud (AWS s3 bucket), but you can you specify a local file-path instead if you have already downloaded the CSV file to your own machine (e.g., using `wget`).\n",
    "Each row in the table `train_data` corresponds to a single training example. In this particular dataset, each row corresponds to an individual person, and the columns contain various characteristics reported during a census.\n",
    "\n",
    "Let's first use these features to predict whether the person's income exceeds $50,000 or not, which is recorded in the `class` column of this table."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/local/home/qiangson/workspace/autosense/code/conda-env-0820/lib/python3.7/site-packages/ipykernel/ipkernel.py:287: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.\n",
      "  and should_run_async(code)\n",
      "2020-12-09 22:46:46,565: [_init_num_threads] Note: detected 96 virtual cores but NumExpr set to maximum of 64, check \"NUMEXPR_MAX_THREADS\" environment variable.\n",
      "2020-12-09 22:46:46,566: [_init_num_threads] Note: NumExpr detected 96 cores but \"NUMEXPR_MAX_THREADS\" not set, so enforcing safe limit of 8.\n",
      "2020-12-09 22:46:46,566: [_init_num_threads] NumExpr defaulting to 8 threads.\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Summary of class variable: \n",
      " count     39073\n",
      "unique        2\n",
      "top       <=50K\n",
      "freq      29704\n",
      "Name: class, dtype: object\n"
     ]
    }
   ],
   "source": [
    "label = 'class'\n",
    "print(\"Summary of class variable: \\n\", train_data[label].describe())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Use fastText model in TabularPredictor"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's add some mock text fields to the original data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/local/home/qiangson/workspace/autosense/code/conda-env-0820/lib/python3.7/site-packages/ipykernel/ipkernel.py:287: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.\n",
      "  and should_run_async(code)\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "sample text column values\n",
      "[' Some-college,  Married-civ-spouse,  Exec-managerial,  Wife,  Private,  United-States,  Female,  White.', ' 10th,  Married-civ-spouse,  Other-service,  Wife,  Private,  United-States,  Female,  White.', ' Some-college,  Married-civ-spouse,  Craft-repair,  Husband,  Private,  United-States,  Male,  White.', ' HS-grad,  Never-married,  Sales,  Not-in-family,  Private,  El-Salvador,  Male,  White.', ' Bachelors,  Married-civ-spouse,  Exec-managerial,  Husband,  Private,  United-States,  Male,  White.']\n"
     ]
    }
   ],
   "source": [
    "train_data['text'] = (\n",
    "    train_data[['education', 'marital-status', 'occupation', 'relationship', \n",
    "                'workclass', 'native-country',  'sex', 'race']]\n",
    "    .apply(lambda r: ', '.join(r.values) + '.', axis=1)\n",
    ")\n",
    "\n",
    "\n",
    "test_data['text'] = (\n",
    "    test_data[['education', 'marital-status', 'occupation', 'relationship',\n",
    "               'workclass', 'native-country',  'sex', 'race']]\n",
    "    .apply(lambda r: ', '.join(r.values) + '.', axis=1)\n",
    ")\n",
    "print('sample text column values')\n",
    "print(train_data['text'].sample(5).to_list())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now, we can specify FastTextModel as one custom model so that you can leverage the ensemble/stacking feature in AutoGluon:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2020-12-09 22:46:46,944: [_fit] Beginning AutoGluon training ...\n",
      "2020-12-09 22:46:46,944: [_fit] AutoGluon will save models to AutogluonModels/ag-test/\n",
      "2020-12-09 22:46:46,944: [_fit] AutoGluon Version:  0.0.15b20201112\n",
      "2020-12-09 22:46:46,945: [_fit] Train Data Rows:    39073\n",
      "2020-12-09 22:46:46,945: [_fit] Train Data Columns: 15\n",
      "2020-12-09 22:46:46,946: [_fit] Preprocessing data ...\n",
      "2020-12-09 22:46:46,962: [infer_problem_type] AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).\n",
      "2020-12-09 22:46:46,962: [infer_problem_type] \t2 unique label values:  ['<=50K', '>50K']\n",
      "2020-12-09 22:46:46,962: [infer_problem_type] \tIf 'binary' is not the correct problem_type, please manually specify the problem_type argument in fit() (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])\n",
      "2020-12-09 22:46:46,980: [__init__] Selected class <--> label mapping:  class 1 = >50K, class 0 = <=50K\n",
      "2020-12-09 22:46:46,983: [general_data_processing] Using Feature Generators to preprocess the data ...\n",
      "2020-12-09 22:46:46,988: [_log] Fitting AutoMLPipelineFeatureGenerator...\n",
      "2020-12-09 22:46:47,009: [_log] \tAvailable Memory:                    257884.49 MB\n",
      "2020-12-09 22:46:47,009: [_log] \tTrain Data (Original)  Memory Usage: 28.97 MB (0.0% of available memory)\n",
      "2020-12-09 22:46:47,011: [_log] \tInferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.\n",
      "2020-12-09 22:46:47,104: [_log] \tStage 1 Generators:\n",
      "2020-12-09 22:46:47,105: [_log] \t\tFitting AsTypeFeatureGenerator...\n",
      "2020-12-09 22:46:47,115: [_log] \tStage 2 Generators:\n",
      "2020-12-09 22:46:47,116: [_log] \t\tFitting FillNaFeatureGenerator...\n",
      "2020-12-09 22:46:47,136: [_log] \tStage 3 Generators:\n",
      "2020-12-09 22:46:47,137: [_log] \t\tFitting IdentityFeatureGenerator...\n",
      "2020-12-09 22:46:47,138: [_log] \t\tFitting IdentityFeatureGenerator...\n",
      "2020-12-09 22:46:47,140: [_log] \t\t\tFitting RenameFeatureGenerator...\n",
      "2020-12-09 22:46:47,143: [_log] \t\tFitting CategoryFeatureGenerator...\n",
      "2020-12-09 22:46:47,187: [_log] \t\t\tFitting CategoryMemoryMinimizeFeatureGenerator...\n",
      "2020-12-09 22:46:47,195: [_log] \t\tFitting TextSpecialFeatureGenerator...\n",
      "2020-12-09 22:46:50,585: [_log] \t\t\tFitting BinnedFeatureGenerator...\n",
      "2020-12-09 22:46:56,027: [_log] \t\t\tFitting DropDuplicatesFeatureGenerator...\n",
      "2020-12-09 22:47:04,787: [_log] \t\tFitting TextNgramFeatureGenerator...\n",
      "2020-12-09 22:47:04,790: [_log] \t\t\tFitting CountVectorizer for text features: ['text']\n",
      "2020-12-09 22:47:05,097: [_log] \t\t\tCountVectorizer fit with vocabulary size = 936\n",
      "2020-12-09 22:47:06,451: [_log] \tStage 4 Generators:\n",
      "2020-12-09 22:47:06,454: [_log] \t\tFitting DropUniqueFeatureGenerator...\n",
      "2020-12-09 22:47:06,853: [_log] \tTypes of features in original data (raw dtype, special dtypes):\n",
      "2020-12-09 22:47:06,854: [print_feature_metadata_full] \t\t('int', [])          : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]\n",
      "2020-12-09 22:47:06,854: [print_feature_metadata_full] \t\t('object', [])       : 8 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]\n",
      "2020-12-09 22:47:06,854: [print_feature_metadata_full] \t\t('object', ['text']) : 1 | ['text']\n",
      "2020-12-09 22:47:06,862: [_log] \tTypes of features in processed data (raw dtype, special dtypes):\n",
      "2020-12-09 22:47:06,869: [print_feature_metadata_full] \t\t('category', [])                    :   8 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]\n",
      "2020-12-09 22:47:06,869: [print_feature_metadata_full] \t\t('category', ['text_as_category'])  :   1 | ['text']\n",
      "2020-12-09 22:47:06,870: [print_feature_metadata_full] \t\t('int', [])                         :   6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]\n",
      "2020-12-09 22:47:06,870: [print_feature_metadata_full] \t\t('int', ['binned', 'text_special']) :  12 | ['text.char_count', 'text.capital_ratio', 'text.lower_ratio', 'text.digit_ratio', 'text.special_ratio', ...]\n",
      "2020-12-09 22:47:06,870: [print_feature_metadata_full] \t\t('int', ['text_ngram'])             : 937 | ['__nlp__.10th', '__nlp__.10th divorced', '__nlp__.10th married', '__nlp__.10th married civ', '__nlp__.10th never', ...]\n",
      "2020-12-09 22:47:06,871: [print_feature_metadata_full] \t\t('object', ['text'])                :   1 | ['text_raw_text']\n",
      "2020-12-09 22:47:06,871: [_log] \t19.9s = Fit runtime\n",
      "2020-12-09 22:47:06,871: [_log] \t15 features in original data used to generate 965 features in processed data.\n",
      "2020-12-09 22:47:06,910: [_log] \tTrain Data (Processed) Memory Usage: 45.59 MB (0.0% of available memory)\n",
      "2020-12-09 22:47:06,929: [_fit] Data preprocessing and feature engineering runtime = 19.98s ...\n",
      "2020-12-09 22:47:06,929: [__init__] AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'\n",
      "2020-12-09 22:47:06,930: [__init__] \tTo change this, specify the eval_metric argument of fit()\n",
      "2020-12-09 22:47:06,930: [__init__] AutoGluon will early stop models using evaluation metric: 'accuracy'\n",
      "2020-12-09 22:47:07,220: [get_preset_models] Custom Model Type Detected: <class 'autogluon.tabular.models.fasttext.fasttext_model.FastTextModel'>\n",
      "2020-12-09 22:47:07,222: [_train_and_save] Fitting model: RandomForest ...\n",
      "2020-12-09 22:47:10,509: [_add_model] \t0.8336\t = Validation accuracy score\n",
      "2020-12-09 22:47:10,510: [_add_model] \t2.5s\t = Training runtime\n",
      "2020-12-09 22:47:10,510: [_add_model] \t0.13s\t = Validation runtime\n",
      "2020-12-09 22:47:10,537: [_train_and_save] Fitting model: FastTextModel ...\n",
      "2020-12-09 22:49:19,766: [_add_model] \t0.7796\t = Validation accuracy score\n",
      "2020-12-09 22:49:19,767: [_add_model] \t128.7s\t = Training runtime\n",
      "2020-12-09 22:49:19,767: [_add_model] \t0.37s\t = Validation runtime\n",
      "2020-12-09 22:49:21,085: [_train_and_save] Fitting model: WeightedEnsemble_L1 ...\n",
      "2020-12-09 22:49:21,323: [_add_model] \t0.84\t = Validation accuracy score\n",
      "2020-12-09 22:49:21,323: [_add_model] \t0.23s\t = Training runtime\n",
      "2020-12-09 22:49:21,324: [_add_model] \t0.0s\t = Validation runtime\n",
      "2020-12-09 22:49:21,340: [_fit] AutoGluon training complete, total runtime = 154.39s ...\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "accuracy: 0.839594636093766\n",
      "       pred   label\n",
      "428    >50K   <=50K\n",
      "9664  <=50K   <=50K\n",
      "366    >50K    >50K\n",
      "4642   >50K   <=50K\n",
      "2100  <=50K   <=50K\n"
     ]
    }
   ],
   "source": [
    "custom_hyperparameters = {'RF': {},\n",
    "                          FastTextModel:  {'epoch': 50},\n",
    "                         }\n",
    "\n",
    "feature_generator = AutoMLPipelineFeatureGenerator(enable_raw_text_features=True)\n",
    "\n",
    "predictor = TabularPredictor(label=label).fit(train_data, hyperparameters=custom_hyperparameters, feature_generator=feature_generator)\n",
    "\n",
    "y_pred = predictor.predict(test_data)\n",
    "df_res = pd.DataFrame({\n",
    "    'pred': y_pred,\n",
    "    'label': test_data[label]\n",
    "})\n",
    "print('accuracy:', (df_res.pred.str.strip() == df_res.label.str.strip()).mean())\n",
    "print(df_res.sample(5))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Sentiment Analysis: SST"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The Standford Sentiment Treebank ([SST](https://nlp.stanford.edu/sentiment/)) dataset aims to predict positive or negative sentiment of movie views. Here we show how the model performs on this dataset. First, let's load the data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/local/home/qiangson/workspace/autosense/code/conda-env-0820/lib/python3.7/site-packages/ipykernel/ipkernel.py:287: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.\n",
      "  and should_run_async(code)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "[{'sentence': 'the format gets used best ... to capture the dizzying heights achieved by motocross and bmx riders , whose balletic hotdogging occasionally ends in bone-crushing screwups . ',\n",
       "  'label': 1},\n",
       " {'sentence': 'scooby dooby doo / and shaggy too / you both look and sound great . ',\n",
       "  'label': 1},\n",
       " {'sentence': \"though it 's become almost redundant to say so , major kudos go to leigh for actually casting people who look working-class . \",\n",
       "  'label': 1},\n",
       " {'sentence': \"inside the film 's conflict-powered plot there is a decent moral trying to get out , but it 's not that , it 's the tension that keeps you in your seat . \",\n",
       "  'label': 1},\n",
       " {'sentence': \"because of an unnecessary and clumsy last scene , ` swimfan ' left me with a very bad feeling . \",\n",
       "  'label': 0}]"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "train_data = pd.read_parquet('https://autogluon-text.s3.amazonaws.com/glue/sst/train.parquet')\n",
    "test_data = pd.read_parquet('https://autogluon-text.s3.amazonaws.com/glue/sst/dev.parquet')\n",
    "test_data.sample(5).to_dict(orient='records')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we train an AutoGluon model with the FastTextModel as one of the custom model types:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "2020-12-09 22:49:25,596: [setup_outputdir] No path specified. Models will be saved in: AutogluonModels/ag-20201210_064925/\n",
      "2020-12-09 22:49:25,597: [_fit] Beginning AutoGluon training ...\n",
      "2020-12-09 22:49:25,597: [_fit] AutoGluon will save models to AutogluonModels/ag-20201210_064925/\n",
      "2020-12-09 22:49:25,598: [_fit] AutoGluon Version:  0.0.15b20201112\n",
      "2020-12-09 22:49:25,598: [_fit] Train Data Rows:    67349\n",
      "2020-12-09 22:49:25,598: [_fit] Train Data Columns: 1\n",
      "2020-12-09 22:49:25,599: [_fit] Preprocessing data ...\n",
      "2020-12-09 22:49:25,614: [infer_problem_type] AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).\n",
      "2020-12-09 22:49:25,614: [infer_problem_type] \t2 unique label values:  [0, 1]\n",
      "2020-12-09 22:49:25,615: [infer_problem_type] \tIf 'binary' is not the correct problem_type, please manually specify the problem_type argument in fit() (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])\n",
      "2020-12-09 22:49:25,619: [__init__] Selected class <--> label mapping:  class 1 = 1, class 0 = 0\n",
      "2020-12-09 22:49:25,621: [general_data_processing] Using Feature Generators to preprocess the data ...\n",
      "2020-12-09 22:49:25,622: [_log] Fitting AutoMLPipelineFeatureGenerator...\n",
      "2020-12-09 22:49:25,628: [_log] \tAvailable Memory:                    257364.58 MB\n",
      "2020-12-09 22:49:25,628: [_log] \tTrain Data (Original)  Memory Usage: 7.47 MB (0.0% of available memory)\n",
      "2020-12-09 22:49:25,629: [_log] \tInferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.\n",
      "2020-12-09 22:49:25,768: [_log] \tStage 1 Generators:\n",
      "2020-12-09 22:49:25,768: [_log] \t\tFitting AsTypeFeatureGenerator...\n",
      "2020-12-09 22:49:25,772: [_log] \tStage 2 Generators:\n",
      "2020-12-09 22:49:25,773: [_log] \t\tFitting FillNaFeatureGenerator...\n",
      "2020-12-09 22:49:25,778: [_log] \tStage 3 Generators:\n",
      "2020-12-09 22:49:25,779: [_log] \t\tFitting IdentityFeatureGenerator...\n",
      "2020-12-09 22:49:25,781: [_log] \t\t\tFitting RenameFeatureGenerator...\n",
      "2020-12-09 22:49:25,783: [_log] \t\tFitting CategoryFeatureGenerator...\n",
      "2020-12-09 22:49:25,890: [_log] \t\t\tFitting CategoryMemoryMinimizeFeatureGenerator...\n",
      "2020-12-09 22:49:25,929: [_log] \t\tFitting TextSpecialFeatureGenerator...\n",
      "2020-12-09 22:49:29,616: [_log] \t\t\tFitting BinnedFeatureGenerator...\n",
      "2020-12-09 22:49:39,386: [_log] \t\t\tFitting DropDuplicatesFeatureGenerator...\n",
      "2020-12-09 22:49:54,829: [_log] \t\tFitting TextNgramFeatureGenerator...\n",
      "2020-12-09 22:49:54,833: [_log] \t\t\tFitting CountVectorizer for text features: ['sentence']\n",
      "2020-12-09 22:49:56,677: [_log] \t\t\tCountVectorizer fit with vocabulary size = 4036\n",
      "2020-12-09 22:50:01,198: [_log] \tStage 4 Generators:\n",
      "2020-12-09 22:50:01,199: [_log] \t\tFitting DropUniqueFeatureGenerator...\n",
      "2020-12-09 22:50:04,213: [_log] \tTypes of features in original data (raw dtype, special dtypes):\n",
      "2020-12-09 22:50:04,214: [print_feature_metadata_full] \t\t('object', ['text']) : 1 | ['sentence']\n",
      "2020-12-09 22:50:04,327: [_log] \tTypes of features in processed data (raw dtype, special dtypes):\n",
      "2020-12-09 22:50:04,443: [print_feature_metadata_full] \t\t('int', ['binned', 'text_special']) :   30 | ['sentence.char_count', 'sentence.word_count', 'sentence.lower_ratio', 'sentence.digit_ratio', 'sentence.special_ratio', ...]\n",
      "2020-12-09 22:50:04,444: [print_feature_metadata_full] \t\t('int', ['text_ngram'])             : 4037 | ['__nlp__.10', '__nlp__.10 minutes', '__nlp__.13', '__nlp__.20', '__nlp__.2002', ...]\n",
      "2020-12-09 22:50:04,444: [print_feature_metadata_full] \t\t('object', ['text'])                :    1 | ['sentence_raw_text']\n",
      "2020-12-09 22:50:04,444: [_log] \t38.6s = Fit runtime\n",
      "2020-12-09 22:50:04,445: [_log] \t1 features in original data used to generate 4068 features in processed data.\n",
      "2020-12-09 22:50:04,586: [_log] \tTrain Data (Processed) Memory Usage: 281.61 MB (0.1% of available memory)\n",
      "2020-12-09 22:50:04,656: [_fit] Data preprocessing and feature engineering runtime = 39.06s ...\n",
      "2020-12-09 22:50:04,656: [__init__] AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'\n",
      "2020-12-09 22:50:04,656: [__init__] \tTo change this, specify the eval_metric argument of fit()\n",
      "2020-12-09 22:50:04,657: [__init__] AutoGluon will early stop models using evaluation metric: 'accuracy'\n",
      "2020-12-09 22:50:06,324: [get_preset_models] Custom Model Type Detected: <class 'autogluon.tabular.models.fasttext.fasttext_model.FastTextModel'>\n",
      "2020-12-09 22:50:06,332: [_train_and_save] Fitting model: RandomForest ...\n",
      "2020-12-09 22:50:42,566: [_add_model] \t0.8568\t = Validation accuracy score\n",
      "2020-12-09 22:50:42,567: [_add_model] \t35.12s\t = Training runtime\n",
      "2020-12-09 22:50:42,567: [_add_model] \t0.15s\t = Validation runtime\n",
      "2020-12-09 22:50:42,646: [_train_and_save] Fitting model: FastTextModel ...\n",
      "2020-12-09 22:52:56,875: [_add_model] \t0.922\t = Validation accuracy score\n",
      "2020-12-09 22:52:56,876: [_add_model] \t133.95s\t = Training runtime\n",
      "2020-12-09 22:52:56,877: [_add_model] \t0.22s\t = Validation runtime\n",
      "2020-12-09 22:52:59,700: [_train_and_save] Fitting model: WeightedEnsemble_L1 ...\n",
      "2020-12-09 22:52:59,957: [_add_model] \t0.9288\t = Validation accuracy score\n",
      "2020-12-09 22:52:59,958: [_add_model] \t0.24s\t = Training runtime\n",
      "2020-12-09 22:52:59,958: [_add_model] \t0.0s\t = Validation runtime\n",
      "2020-12-09 22:53:00,031: [_fit] AutoGluon training complete, total runtime = 214.43s ...\n"
     ]
    }
   ],
   "source": [
    "custom_hyperparameters = {'RF': {},\n",
    "                         FastTextModel:  {'epoch': 100},\n",
    "                         }\n",
    "\n",
    "feature_generator = AutoMLPipelineFeatureGenerator(enable_raw_text_features=True)\n",
    "\n",
    "label = 'label'\n",
    "predictor = TabularPredictor(label=label).fit(train_data, hyperparameters=custom_hyperparameters, feature_generator=feature_generator)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we use the trained model to make predictions on the test data and check its accuracy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "accuracy: 0.8004587155963303\n"
     ]
    }
   ],
   "source": [
    "y_pred = predictor.predict(test_data)\n",
    "print('accuracy:', accuracy_score(y_pred, test_data[label]))"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.7"
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {
    "height": "calc(100% - 180px)",
    "left": "10px",
    "top": "150px",
    "width": "165px"
   },
   "toc_section_display": true,
   "toc_window_display": true
  },
  "varInspector": {
   "cols": {
    "lenName": 16,
    "lenType": 16,
    "lenVar": 40
   },
   "kernels_config": {
    "python": {
     "delete_cmd_postfix": "",
     "delete_cmd_prefix": "del ",
     "library": "var_list.py",
     "varRefreshCmd": "print(var_dic_list())"
    },
    "r": {
     "delete_cmd_postfix": ") ",
     "delete_cmd_prefix": "rm(",
     "library": "var_list.r",
     "varRefreshCmd": "cat(var_dic_list()) "
    }
   },
   "types_to_exclude": [
    "module",
    "function",
    "builtin_function_or_method",
    "instance",
    "_Feature"
   ],
   "window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
